Skip to content

whisper : map token timestamps to original time when VAD is enabled#3910

Open
buxuku wants to merge 1 commit into
ggml-org:masterfrom
buxuku:vad-token-timestamps
Open

whisper : map token timestamps to original time when VAD is enabled#3910
buxuku wants to merge 1 commit into
ggml-org:masterfrom
buxuku:vad-token-timestamps

Conversation

@buxuku

@buxuku buxuku commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

When VAD is enabled, the segment getters (whisper_full_get_segment_t0/t1) already map timestamps back to the original audio timeline, but the per-token timestamps in whisper_full_get_token_data() stay in the VAD-processed timeline with the silence removed. So if you build word-level timing on top of the token times while VAD is on, the words drift off by however much silence VAD stripped out, and there's no public getter that applies the mapping.

This adds whisper_full_get_token_t0/t1 (plus the _from_state variants) that map the token times back. A token inside a speech segment is interpolated within that segment; a token that falls in the silence removed between two segments is snapped to the nearest boundary, so it doesn't end up in the middle of a gap that isn't in the original audio. With VAD off, or when there's no segment info, the stored token times are returned unchanged, so existing callers aren't affected.

I hit this doing word-level re-segmentation with VAD enabled: the segment times lined up with the original audio but the token times didn't. Also extended tests/test-vad-full.cpp to exercise the new getters. Built and ran it on macOS.

@danbev danbev left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional, but perhaps we could extend test-vad-full.cpp to exercise these new functions.

whisper_full_get_token_data().t0/t1 are in the VAD-processed timeline (silences
removed), so they don't line up with the original audio. Add
whisper_full_get_token_t0/t1 that map them back.

A token inside a speech segment is interpolated within it; a token that falls in a
removed inter-segment silence snaps to the nearest boundary, so it never lands in
the middle of a cut-out gap. Without VAD the raw times are returned unchanged.
@buxuku buxuku force-pushed the vad-token-timestamps branch from c008fa5 to e14e08b Compare June 27, 2026 13:35
@buxuku

buxuku commented Jun 27, 2026

Copy link
Copy Markdown
Contributor Author

@danbev pushed an update on top of the approved version: the token times are now mapped segment by segment instead of one interpolation over the whole mapping table, and a token that lands in a removed silence snaps to the nearest speech boundary rather than somewhere in the middle of a gap that isn't in the original audio. Also extended test-vad-full.cpp to cover the new getters as you suggested. PTAL when you have a moment, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants